GPU Architecture

A warp contains multiple thread processors (typically 32/64). All processors in a warp run the same code simultaneously.

Each core has some memory allocated for both L1 cache and shared memory. Each core contains 4 processing blocks (which can run a warp each).

A dispatched workgroup may run on multiple warps.

All cores share an L2 cache.